Exploratory data analysis example - 1C Company

This notebook shows some exploratory data exploration of retail sales data in the context predicting future sales numbers of items at different retail outlets.

This data was provided by the Russian software publisher and retailer 1C Company, for a Kaggle competition in which the challenge is to predict monthly sales for specific products in specific shops.

Table of contents

Import packages and load data

Import python packages

Load data

First we'll list the provided files.

We'll load all the files into the workspace as pandas dataframes. "sales_train" will be renamed to "train" for convenience.

Show table contents

Have a quick look at the train data.

Merge tables

It looks like the training data has been structured as normalized tables for efficiency. For convenience we can merge the training data into a single table.

Initial exploration and data cleaning

Check datatypes

Check the datatypes of the columns.

Most fields have appropriate datatype, although the numeric fields could potentially be downcasted to save memory.

Convert the date column to datetime

The date field is formatted as a string, we can convert that to the datetime dtype to enable extra datetime features such as grouping by weeks or months.

Check for missing values

Now we have prepared the dataframe we can check it for missing values.

First look at the merged training dataframe

There aren't any missing values to worry about, so we can have a quick look at the merged dataframe to get an idea of its contents.

It looks like the rows are individual sales counts for specific shop-item combinations on specific days, with the number of item sales in the column "item_cnt_day". There is a negative value of item_cnt_day in row 2, so it looks like the data also includes entries for returned items.

The pandas "describe" method gives a good summary of the range of values in each numerical column. We'll round values to the nearest integer for clarity and append the number of unique values for each column.

Initial summary

From this we can see:

Cleaning rows with outliers & invalid values

Some quick investigation finds that only one otherwise unremarkable entry has a negative price, which should be safe to remove.

The entries with very high price or sales values are so few in number they can be looked at individually and deleted if appropriate. The very highest valued entries are custom orders for large numbers of items that are probably best removed by removing items with values above an appropriate threshold.

There are 7356 columns (0.25% of the total) with apparently valid negative item_cnt_day values. We delete these because they cause problems when aggregating sales by month, such as months with negative sales totals.

Examine shop names and sales over time

The training data contains multiple shop_ids, not all of which are in the test set. To get an overview we can create a figure which plots total sales for each shop by month and displays the shop names in Russian and English. A table with a list of translated shop names is used for this.

Correct shop labels and remove data from unusual shops

A closer look at these plots finds several data cleaning issues, such as:

We will merge the duplicate shops and remove all data from shops not in the test month, for simplicity.

(A similar check of item categories finds no data quality issues)

Remove duplicate items

Finally, we'll check for remaining duplicate entries in the training data.

There are only 5 duplicate entries, but the fact that 4 of them are for the same product suggests that they are errors, so we might as well drop them.

Create a model training dataset by reshaping the data

Aggregate the data by month

The competition challenge was to predict monthly sales totals, so we should create a training set that follows the test format by summing sales for each month.

The format of the test items is a list of all possible combinations of shops and items for shops and items that recorded at least one sale in the test month, i.e. the Cartesian product of these shops and items. We recreate this by summing the sales for the Cartesian product of active shops and items sold in each month.

As before, merge the provided items, categories and shops tables.

Show the head of the table to check it looks ok.

We can look at the the distributions of target and other features, and look for interesting patterns of relationships between variables.

The aim of this exploration should be to find patterns in the data that could help predict the target value, and identify the types of prediction model that are appropriate.

Distribution of the target feature

We plot an initial histogram of the target item_cnt_month feature, with a smoothed distribution estimate.

The distribution clearly has a very large peak close to zero. Creating sales counts for all items for all shops in each month might lead to lots of entries with zero sales, we should check what proportion of item counts are now zero.

The distribution of non-zero values would be obscured if all values were plotted together, so we plot the distribution of targets with zero values removed.

Even with zeros removed, the target is very bottom-heavy, with only around 5% of values above 1, although a small number of items sell much more than this.

The skewness of the distribution makes linear models unsuitable for predicting future sales of items, as assumptions for linear models will not be met. Instead, non-linear models such as decision trees or k-nearest neighbor models would be more suitable.

Item price distribution

We can also plot the distribution of items prices. For this, we take the mean price of the item in months in which it was sold. To do with we create a table which summarizes monthly data across all shops.

Prices are also highly skewed towards zero. We can display the distribution more clearly by using a log scale.

Here we can see that the price distribution is approximately lognormal with a peak slightly below 10.

Joint distribution of item prices and sales

Finally, we can also plot the joint distribution of monthly items sales and mean prices, again using a log scale for clarity.

This reveals no strong overall associations between price and sales, although associations may exist in subgroups of the data.

Outliers are also apparent.

Plotting total sales counts per month shows clear downwards and seasonal trends. However the mean number of each item sold per month (which is what is to be predicted) shows a less pronounced downwards trend.

Seasonally decompose data

Mean sales per item can also be decomposed into seasonal and continuous trends using the statsmodels package. This show a clear yearly seasonal trend (particularly a peak around the winter holidays) and an overall downwards trend that can be assumed to related to the rise of internet and digital-only sales.

The overall sales trend is clearly downwards, but there are differences at the item category level. Compare the following trend and seasonal decomposition plots for games for the PS3 and PS4.

Sales by item category

Each item is assigned one of 80+ categories which identify what kind of product it is and what format it is for. Information about these categories could be predictive because different types of item are likely to sell in different amounts.

First we plot mean sales and revenue per item in each category across all shops, for the last year of sales data.

Most obviously this plot shows that items in one particular category - "Gifts - bags, albums, mousepads" - have a much higher sales volume than items in other categories, but comparitively little revenue. This category contains few items and denotes low cost items such as promotional bags and mousemats.

Economic and predictive importance may be better represented by plotting summed rather than mean values in each category.

Plotting summed sales shows that movies and games are the highest-selling categories overall, with PS4 games accounting for the most revenue. As well as being predictive in itself, this information could be useful for deciding what categories to prioritize when building predictive models or allocating promotional resources.

Sales by shop category

Summed sales can also be plotted when data is grouped according to shop.

Some shops, paricularly those in Moscow, sell much more items overall than others. All else being equal, an item sold in these shops can be predicted to sell in greater quantities.

Shop - category sales profile and decomposition of sales

Some shops sell more than others, but are there differences in the relative quantities of items from each category that shops sell?

The individual summed sales per category profile of each shop can be created and is plotted here as a heatmap.

PCA decomposition and clustering of shops

The vertical stripes in this heatmap indicate shops that differ from the mean in some way, but high-dimensional data like this can be difficult to understand without some kind of summary.

Principle component analysis (PCA) lets us decompose high dimensional data into a low dimensional representation that makes it easier to get an overview of patterns in the data. Although the shop-category sales counts do not have the ideal (normal) distribution for use with PCA, we can still do this to try to gain some insight into the similarities and differences between shops.

The explained variance plot shows that around 85% of the differences between the shops can be explained by two linear components, and that almost all of the shops lie approximately on a straight when plotted on these dimensions.

Insight into what these components correspond to can be gained by plotting the components that map the component scores to the original data. We do this here for the two components plotted above, and sort the elements of the components by descending magnitude.

Looking first at component 2, we see that this is mostly highly weighted on "digital" categories, i.e. non-physical online downloads. The fact that shop 55 is so far from the other shops on this dimension is explained by this being the ID of an online store, and highlights that shop 55 has a very different sales profile to the other shops and that it may help to handle this differently when making predictions.

Th fact that the non-digital shops lie more or less on a straight line in the principle component representation above indicates that the non-digital shops mainly differ in the magnitude of their sales volumes rather than differences in the types of items which they sell.

Engineered features

Sales as a function of item age

The age of items when they are sold can be approximately calculated by subtracting from this date the first date or month on which they were sold.

Total monthly item sales as a function of item age is plotted below for all items, and separately for items in two representative categories.

Plotting total monthly item sales as a function of their ages that items tend to sell most when they are new and then decline to a plateau about 1 year later. The slightly lower sales in the first compared to the second month is attributable to items not always being available for the whole first month.

It is also evident that this trend for items to sell most shortly after their release is more evident for some categories, such as movies, compared to others.

Predicting sales from historical sales

Even when taking the decline of sales volume over time into account, it seems likely that items that sell well in one month are likely to also sell well in the following month. A column can be created which contains the sales figures from the previous month for the sale shop-item combination.

We can create a regression plot of sales counts as a function of previous months sales, for a sample items which are at least a month old. For clarity, we use log scales on the axes and plot the estimate of the central tendency (mean) of item_cnt_month.

Moving average of historical sales.

Sales in individual months are mostly low-valued and tend to be noisy. To reduce this noise, windowing methods can be used to calculate a historical mean as a weighted sum of the sales from multiple previous months. An example showing multiple types of window is shown below.

We create windowed 12-month average and exponential moving average (in which recent months are weighed more than less recent months) sales count features and display regression plots below.

Predictive information about new items

While previous months sales are useful, for new items this is not available and alternative information must be used to make predictions.

The item category and shop id fields can be used to determine mean item sales counts for items in their first month of sales. This is plotted below for the last year of sales data.

Again, the mean new item sales for item category - shop combinations can also be calculated.

Shop averages, category averages, shop-category interactions

Text features

In addition to the information contained in the item categories and shop identities, all items have an associated item_name text feature, which contains a short description of the item that often includes things such as its title, format (e.g. PS4 or PS3) and language.

To aid extraction of information from text it is often useful to clean the text of irrelevant special characters, excess blank spaces, low-information words such as "the", and converting all text to lowercase. If necessary this can be performed by regular expression operations, as demonstrated here:

One way that the item_name field could be used is by extracting individual words or n-grams (groups of sequential words) and treating them as individual binary categories. Doing this creates a very large number of features so some kind of filtering will likely be necessary, such as specifying minimum numbers of occurences of a word feature. Another feature selection technique is to filter items according to some kind of metric of relevance to the target variable, such as correlation.

Below is an example of the 1 and 2-ngrams producted from a single item_name text string.

Grouping similar items with similar names

Items with different item_ids are often related to each other, such as being different versions of the same video game or movie, and so are likely to have related sales figures. This can be taken advantage of by grouping similar items together based on their item_names.

The Python package TheFuzz implements fuzzy string matching to measure the similarity of sequences of word. The following code uses this with an alphabetical sorting of item names to group related items together. This item name group can be used as other categorical features.

The example list of items after shows that related items (e.g. the same video game for different consoles) are assigned to the same group.

Mean encoding and none-stationarity of category sales

Categorical features such as name groups and shop ids have useful predictive information, but the relationship between category values and the target variable is not consistent over the training data, as items, item categories and shops increase or (more often) decline in popularity over time.

Theoretically, a model could learn the interactions between individual category values and time periods, but for categorical features with high numbers of values, such as item identities and name groups, this could require very complex models and cause problems with overfitting.

To save the predictive model having to learn the individual relationships between individual category values and specific time periods, a useful solution is to turn categorical variables into numerical variables by reencoding each value of the category as the mean of the target variable with items with this value, for some time window before the time of the current item. As with individual items, different time windows can be used.

Here are show example windowed mean encodings for 3 values of the item category name feature using 3 different temporal windows.

Define a function to fit and return a lightgbm regressor with or without early stopping

Split train and validation sets from the feature matrix, month 33 used as validation set

These hyperparameters were found by using the hyperparameter optimization framework Optuna to optimize hyperparameters for the validation set.

Fit the booster using early stopping with the validation set